mean_squared_error (MSE)#

Mean squared error (MSE) is one of the most common regression metrics and training losses. It measures the average squared distance between targets and predictions.


Learning goals#

  • Understand the definition, units, and common variants (weights, multioutput)

  • Build intuition for why squaring emphasizes large errors (outliers)

  • Implement MSE from scratch in NumPy

  • Visualize the loss landscape for a simple model

  • Use MSE to fit a linear regression model via gradient descent

Prerequisites#

  • Basic NumPy arrays and broadcasting

  • Derivatives of polynomials (or comfort with gradients)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.metrics import mean_squared_error

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) Definition#

Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) predictions.

\[ \mathrm{MSE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

Define the residual \(r_i = y_i - \hat{y}_i\). In vector form:

\[ \mathrm{MSE}(y, \hat{y}) = \frac{1}{n}\lVert y - \hat{y} \rVert_2^2 \]

Weighted MSE#

If some samples matter more (or have different noise levels), use non-negative weights \(w_i\):

\[ \mathrm{WMSE}(y, \hat{y}; w) = \frac{\sum_{i=1}^n w_i (y_i-\hat{y}_i)^2}{\sum_{i=1}^n w_i} \]

Multioutput#

For \(y,\hat{y} \in \mathbb{R}^{n\times d}\) (d targets per sample), compute per-output MSE:

\[ \mathrm{MSE}_j = \frac{1}{n}\sum_{i=1}^n (y_{ij}-\hat{y}_{ij})^2 \]

and then aggregate across outputs (uniform average or a weighted average).

Units#

If \(y\) is measured in meters, then MSE is measured in meters². For interpretability, people often report \(\mathrm{RMSE}=\sqrt{\mathrm{MSE}}\) in meters.

y_true = np.array([2.0, 0.0, 4.0, 1.0])
y_pred = np.array([1.5, 0.2, 3.0, 2.0])

residual = y_true - y_pred
sq_error = residual**2
mse = float(np.mean(sq_error))

print("y_true   =", y_true)
print("y_pred   =", y_pred)
print("residual =", residual)
print("(residual)^2 =", sq_error)
print("MSE (NumPy)  =", mse)
print("MSE (sklearn)=", mean_squared_error(y_true, y_pred))
y_true   = [2. 0. 4. 1.]
y_pred   = [1.5 0.2 3.  2. ]
residual = [ 0.5 -0.2  1.  -1. ]
(residual)^2 = [0.25 0.04 1.   1.  ]
MSE (NumPy)  = 0.5725
MSE (sklearn)= 0.5725
idx = np.arange(len(y_true))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Residuals (y - ŷ)", "Contribution to MSE: (y - ŷ)²"),
)

fig.add_trace(
    go.Scatter(x=idx, y=y_true, mode="markers", name="y_true", marker=dict(size=10)),
    row=1,
    col=1,
)
fig.add_trace(
    go.Scatter(x=idx, y=y_pred, mode="markers", name="y_pred", marker=dict(size=10)),
    row=1,
    col=1,
)

for i in idx:
    fig.add_trace(
        go.Scatter(
            x=[i, i],
            y=[y_pred[i], y_true[i]],
            mode="lines",
            line=dict(color="rgba(0,0,0,0.35)", width=2),
            showlegend=False,
        ),
        row=1,
        col=1,
    )

fig.add_trace(go.Bar(x=idx, y=sq_error, name="(y - ŷ)²"), row=1, col=2)

fig.update_xaxes(title_text="sample index", row=1, col=1)
fig.update_yaxes(title_text="value", row=1, col=1)
fig.update_xaxes(title_text="sample index", row=1, col=2)
fig.update_yaxes(title_text="squared error", row=1, col=2)

fig.update_layout(title=f"MSE = mean((y - ŷ)²) = {mse:.3f}", legend=dict(orientation="h"))
fig.show()

2) Why do we square the errors?#

For a single residual \(r\), the squared loss is:

\[ \ell(r) = r^2 \]

Key properties:

  • Emphasizes large errors: doubling \(|r|\) multiplies the penalty by \(4\).

  • Smooth and differentiable: useful for gradient-based optimization.

  • For many models (e.g. linear regression), the MSE objective is convex.

The derivative is simple:

\[ \ell'(r)=2r \]

For MSE, the derivative with respect to a prediction \(\hat{y}_i\) is:

\[ \frac{\partial}{\partial \hat{y}_i} \mathrm{MSE}(y, \hat{y}) = -\frac{2}{n}(y_i - \hat{y}_i) \]

So the gradient magnitude grows linearly with the residual: big mistakes produce big updates.

A common alternative is MAE (mean absolute error), which is more robust to outliers but is not differentiable at \(r=0\).

r = np.linspace(-5, 5, 400)

fig = go.Figure()
fig.add_trace(go.Scatter(x=r, y=r**2, mode="lines", name="Squared loss r²"))
fig.add_trace(go.Scatter(x=r, y=np.abs(r), mode="lines", name="Absolute loss |r|"))

fig.update_layout(
    title="How loss grows with the residual r",
    xaxis_title="residual r",
    yaxis_title="loss",
)
fig.show()

3) Probabilistic view: why MSE is ‘natural’ for Gaussian noise#

Assume a regression model predicts the conditional mean and errors are Gaussian:

\[ Y_i = \hat{y}_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2) \]

The negative log-likelihood (up to constants that don’t depend on \(\hat{y}\)) is:

\[ -\log p(y\mid \hat{y}) \propto \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\hat{y}_i)^2 \]

So minimizing squared error is equivalent to maximum likelihood estimation under i.i.d. Gaussian noise.

If different points have different noise levels (heteroskedasticity), a weighted MSE (weighted least squares) can be more appropriate.

4) Mean squared error from scratch (NumPy)#

Below is a small implementation that mirrors the core ideas in sklearn.metrics.mean_squared_error:

  • works for 1D targets (n,) and multioutput targets (n, d)

  • optional sample_weight for weighted MSE

  • multioutput aggregation: 'raw_values', 'uniform_average', or an array of output weights

This is intentionally “low-level”: no pandas, no sklearn internals.

def mean_squared_error_np(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput="uniform_average",
):
    '''Compute mean squared error (MSE) with optional sample weights and multioutput support.

    Parameters
    ----------
    y_true, y_pred:
        Array-like with shape (n,) or (n, d)
    sample_weight:
        None or array-like with shape (n,)
    multioutput:
        - 'raw_values'        -> return array of shape (d,)
        - 'uniform_average'   -> return scalar (mean over outputs)
        - array-like (d,)     -> weighted average over outputs

    Returns
    -------
    float or np.ndarray
    '''

    y_true_arr = np.asarray(y_true, dtype=float)
    y_pred_arr = np.asarray(y_pred, dtype=float)

    if y_true_arr.shape != y_pred_arr.shape:
        raise ValueError(f"Shape mismatch: y_true {y_true_arr.shape} vs y_pred {y_pred_arr.shape}")

    if y_true_arr.ndim == 1:
        y_true_arr = y_true_arr[:, None]
        y_pred_arr = y_pred_arr[:, None]
    elif y_true_arr.ndim != 2:
        raise ValueError("y_true/y_pred must be 1D or 2D")

    n_samples, n_outputs = y_true_arr.shape

    sq_error = (y_true_arr - y_pred_arr) ** 2  # (n, d)

    if sample_weight is not None:
        w = np.asarray(sample_weight, dtype=float)
        if w.ndim != 1 or w.shape[0] != n_samples:
            raise ValueError(f"sample_weight must have shape ({n_samples},), got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")

        denom = float(np.sum(w))
        if denom == 0.0:
            raise ValueError("sum(sample_weight) must be > 0")

        mse_per_output = np.sum(sq_error * w[:, None], axis=0) / denom
    else:
        mse_per_output = np.mean(sq_error, axis=0)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return mse_per_output
        if multioutput == "uniform_average":
            return float(np.mean(mse_per_output))
        raise ValueError(
            "multioutput must be 'raw_values', 'uniform_average', or array-like of output weights"
        )

    output_weights = np.asarray(multioutput, dtype=float)
    if output_weights.shape != (n_outputs,):
        raise ValueError(f"multioutput weights must have shape ({n_outputs},), got {output_weights.shape}")

    weight_sum = float(np.sum(output_weights))
    if weight_sum == 0.0:
        raise ValueError("sum(multioutput weights) must be > 0")

    return float(np.dot(mse_per_output, output_weights) / weight_sum)
# Sanity checks vs sklearn

# 1D
y_true_1d = rng.normal(size=50)
y_pred_1d = y_true_1d + rng.normal(scale=0.5, size=50)

print(
    "1D equal:",
    np.isclose(mean_squared_error_np(y_true_1d, y_pred_1d), mean_squared_error(y_true_1d, y_pred_1d)),
)

# Multioutput
y_true_2d = rng.normal(size=(50, 3))
y_pred_2d = y_true_2d + rng.normal(scale=0.5, size=(50, 3))

mse_raw_np = mean_squared_error_np(y_true_2d, y_pred_2d, multioutput="raw_values")
mse_raw_sk = mean_squared_error(y_true_2d, y_pred_2d, multioutput="raw_values")

print("multioutput raw close:", np.allclose(mse_raw_np, mse_raw_sk))

# Weighted
w = rng.uniform(0.1, 2.0, size=50)

mse_w_np = mean_squared_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput="uniform_average")
mse_w_sk = mean_squared_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput="uniform_average")

print("weighted uniform close:", np.isclose(mse_w_np, mse_w_sk))

# Weighted multioutput
out_w = np.array([0.2, 0.3, 0.5])

mse_wo_np = mean_squared_error_np(y_true_2d, y_pred_2d, sample_weight=w, multioutput=out_w)
mse_wo_sk = mean_squared_error(y_true_2d, y_pred_2d, sample_weight=w, multioutput=out_w)

print("weighted output close:", np.isclose(mse_wo_np, mse_wo_sk))
1D equal: True
multioutput raw close: True
weighted uniform close: True
weighted output close: True

5) Geometry intuition: constant predictions#

Suppose your model can only predict a single constant value \(a\) for every sample:

\[ \hat{y}_i = a \]

The MSE becomes a function of a single parameter:

\[ J(a) = \frac{1}{n}\sum_{i=1}^n (y_i-a)^2 \]

Differentiate and set to zero:

\[ \frac{dJ}{da} = -\frac{2}{n}\sum_{i=1}^n (y_i-a)=0 \quad\Rightarrow\quad a = \frac{1}{n}\sum_{i=1}^n y_i = \bar{y} \]

So the constant that minimizes MSE is the mean.

The best constant MSE equals the variance#

At the optimum \(a=\bar{y}\), the minimum value is:

\[ J(\bar{y}) = \frac{1}{n}\sum_{i=1}^n (y_i-\bar{y})^2 = \mathrm{Var}(y) \quad (\mathrm{ddof}=0) \]

So predicting the mean gives an MSE equal to the (population) variance of the target.

Weighted MSE: the weighted mean#

If we use weights \(w_i \ge 0\):

\[ J_w(a)=\frac{\sum_i w_i (y_i-a)^2}{\sum_i w_i} \]

then the minimizer is the weighted mean:

\[ a^* = \frac{\sum_i w_i y_i}{\sum_i w_i} \]

Interpreting \(w_i \propto 1/\sigma_i^2\) connects this to weighted least squares under heteroskedastic Gaussian noise.

This is also why MSE is sensitive to outliers: a single extreme value can move the mean (and therefore the optimum) a lot. If you have a reason to trust some points less, sample weights can reduce their influence.

y = rng.normal(loc=0.0, scale=1.0, size=60)
y_outlier = y.copy()
y_outlier[0] = 8.0  # one extreme point

a_grid = np.linspace(-4, 9, 500)

mse_clean = np.mean((y[:, None] - a_grid[None, :]) ** 2, axis=0)
mse_out = np.mean((y_outlier[:, None] - a_grid[None, :]) ** 2, axis=0)

mu_clean = float(np.mean(y))
mu_out = float(np.mean(y_outlier))

mse_at_mu_clean = float(np.mean((y - mu_clean) ** 2))
mse_at_mu_out = float(np.mean((y_outlier - mu_out) ** 2))

# Downweight the outlier (one simple way to reduce its influence)
w_down = np.ones_like(y_outlier)
w_down[0] = 0.05

wmse_down = (
    np.sum(((y_outlier[:, None] - a_grid[None, :]) ** 2) * w_down[:, None], axis=0) / float(np.sum(w_down))
)
mu_w = float(np.sum(w_down * y_outlier) / np.sum(w_down))
wmse_at_mu_w = float(np.sum(((y_outlier - mu_w) ** 2) * w_down) / np.sum(w_down))

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("MSE(a) is minimized at the mean", "Weighted MSE can reduce outlier influence"),
)

# Left: clean data vs the same data with an outlier
fig.add_trace(go.Scatter(x=a_grid, y=mse_clean, mode="lines", name="MSE(a) (clean)"), row=1, col=1)
fig.add_trace(go.Scatter(x=a_grid, y=mse_out, mode="lines", name="MSE(a) (with outlier)"), row=1, col=1)

fig.add_vline(x=mu_clean, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=1)
fig.add_vline(x=mu_out, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=1)

fig.add_annotation(
    x=mu_clean,
    y=mse_at_mu_clean,
    text=f"mean={mu_clean:.2f}<br>min MSE=Var(y)≈{mse_at_mu_clean:.2f}",
    showarrow=True,
    yshift=10,
    row=1,
    col=1,
)
fig.add_annotation(
    x=mu_out,
    y=mse_at_mu_out,
    text=f"mean={mu_out:.2f}",
    showarrow=True,
    yshift=10,
    row=1,
    col=1,
)

# Right: the outlier dataset with uniform weights vs a small outlier weight
fig.add_trace(
    go.Scatter(
        x=a_grid,
        y=mse_out,
        mode="lines",
        name="MSE(a) (uniform weights)",
        showlegend=False,
    ),
    row=1,
    col=2,
)
fig.add_trace(go.Scatter(x=a_grid, y=wmse_down, mode="lines", name="WMSE(a) (downweight outlier)"), row=1, col=2)

fig.add_vline(x=mu_out, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=2)
fig.add_vline(x=mu_w, line_dash="dash", line_color="rgba(0,0,0,0.45)", row=1, col=2)

fig.add_annotation(
    x=mu_out,
    y=mse_at_mu_out,
    text=f"uniform mean={mu_out:.2f}",
    showarrow=True,
    yshift=10,
    row=1,
    col=2,
)
fig.add_annotation(
    x=mu_w,
    y=wmse_at_mu_w,
    text=f"weighted mean={mu_w:.2f}",
    showarrow=True,
    yshift=10,
    row=1,
    col=2,
)

fig.update_xaxes(title_text="constant prediction a", row=1, col=1)
fig.update_yaxes(title_text="loss", row=1, col=1)
fig.update_xaxes(title_text="constant prediction a", row=1, col=2)
fig.update_yaxes(title_text="loss", row=1, col=2)

fig.update_layout(
    title="Constant predictor: mean (and weighted mean) minimize squared error",
    legend=dict(orientation="h"),
)
fig.show()

6) Using MSE to optimize a model: simple linear regression#

Model#

With one feature \(x\), a linear model predicts:

\[ \hat{y}_i = b_0 + b_1 x_i \]

Objective#

Minimize MSE over parameters \((b_0, b_1)\):

\[ J(b_0, b_1) = \frac{1}{n}\sum_{i=1}^n \big(y_i - (b_0 + b_1 x_i)\big)^2 \]

Gradients#

Let \(\hat{y}_i=b_0+b_1x_i\) and residual \(r_i=y_i-\hat{y}_i\).

\[ \frac{\partial J}{\partial b_0} = -\frac{2}{n}\sum_{i=1}^n r_i \qquad \frac{\partial J}{\partial b_1} = -\frac{2}{n}\sum_{i=1}^n x_i r_i \]

Gradient descent#

Update parameters with learning rate \(\eta\):

\[ (b_0, b_1) \leftarrow (b_0, b_1) - \eta\,\nabla J(b_0, b_1) \]

For linear regression, there is also a closed-form solution (ordinary least squares), but gradient descent shows how MSE acts as an optimization landscape.

# Synthetic dataset
n = 150
x = rng.uniform(-3, 3, size=n)

b0_true = 1.5
b1_true = 2.0
noise = rng.normal(0, 0.8, size=n)

y = b0_true + b1_true * x + noise

fig = px.scatter(x=x, y=y, title="Synthetic regression data", labels={"x": "x", "y": "y"})
fig.show()
def mse_for_line(x, y, b0, b1):
    y_hat = b0 + b1 * x
    r = y - y_hat
    return float(np.mean(r**2))


def mse_gradients_for_line(x, y, b0, b1):
    'Return (mse, dJ/db0, dJ/db1) for y_hat = b0 + b1 x.'

    y_hat = b0 + b1 * x
    r = y - y_hat
    n = x.shape[0]

    mse = float(np.mean(r**2))
    db0 = float((-2.0 / n) * np.sum(r))
    db1 = float((-2.0 / n) * np.sum(x * r))
    return mse, db0, db1


def fit_line_gd(x, y, *, lr=0.05, n_steps=200):
    b0, b1 = 0.0, 0.0

    history = {
        "step": [],
        "b0": [],
        "b1": [],
        "mse": [],
    }

    for step in range(n_steps):
        mse, db0, db1 = mse_gradients_for_line(x, y, b0, b1)

        history["step"].append(step)
        history["b0"].append(b0)
        history["b1"].append(b1)
        history["mse"].append(mse)

        b0 -= lr * db0
        b1 -= lr * db1

    return (b0, b1), history
(b0_gd, b1_gd), hist = fit_line_gd(x, y, lr=0.05, n_steps=200)

# Closed-form OLS (for comparison)
X = np.column_stack([np.ones_like(x), x])
b0_ols, b1_ols = np.linalg.lstsq(X, y, rcond=None)[0]

print(f"true params: b0={b0_true:.3f}, b1={b1_true:.3f}")
print(f"GD   params: b0={b0_gd:.3f}, b1={b1_gd:.3f}, MSE={mse_for_line(x, y, b0_gd, b1_gd):.4f}")
print(f"OLS  params: b0={b0_ols:.3f}, b1={b1_ols:.3f}, MSE={mse_for_line(x, y, b0_ols, b1_ols):.4f}")

fig = go.Figure()
fig.add_trace(go.Scatter(x=hist["step"], y=hist["mse"], mode="lines", name="MSE"))
fig.update_layout(title="Gradient descent on MSE", xaxis_title="step", yaxis_title="MSE")
fig.show()

# Fit line
x_line = np.linspace(x.min(), x.max(), 200)

y_hat_gd = b0_gd + b1_gd * x_line
y_hat_ols = b0_ols + b1_ols * x_line

fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="markers", name="data", marker=dict(size=6, opacity=0.75)))
fig.add_trace(go.Scatter(x=x_line, y=y_hat_gd, mode="lines", name="GD fit"))
fig.add_trace(go.Scatter(x=x_line, y=y_hat_ols, mode="lines", name="OLS fit", line=dict(dash="dash")))
fig.update_layout(title="Fitted line", xaxis_title="x", yaxis_title="y")
fig.show()
true params: b0=1.500, b1=2.000
GD   params: b0=1.475, b1=1.966, MSE=0.6438
OLS  params: b0=1.475, b1=1.966, MSE=0.6438
# Visualize the MSE surface over (b0, b1) with the GD path

b0_grid = np.linspace(b0_ols - 2.0, b0_ols + 2.0, 160)
b1_grid = np.linspace(b1_ols - 2.0, b1_ols + 2.0, 160)

B0, B1 = np.meshgrid(b0_grid, b1_grid)
residuals = y[None, None, :] - (B0[:, :, None] + B1[:, :, None] * x[None, None, :])
Z = np.mean(residuals**2, axis=2)

stride = max(1, len(hist["step"]) // 120)
b0_path = np.array(hist["b0"])[::stride]
b1_path = np.array(hist["b1"])[::stride]

fig = go.Figure()
fig.add_trace(
    go.Contour(
        x=b0_grid,
        y=b1_grid,
        z=Z,
        contours_coloring="heatmap",
        colorbar=dict(title="MSE"),
    )
)
fig.add_trace(
    go.Scatter(
        x=b0_path,
        y=b1_path,
        mode="lines+markers",
        name="GD path",
        marker=dict(size=4, color="black"),
        line=dict(color="black"),
    )
)
fig.add_trace(
    go.Scatter(
        x=[b0_ols],
        y=[b1_ols],
        mode="markers",
        name="OLS optimum",
        marker=dict(symbol="x", size=10, color="white", line=dict(color="black", width=2)),
    )
)

fig.update_layout(
    title="MSE landscape for a line: convex bowl + gradient descent trajectory",
    xaxis_title="b0 (intercept)",
    yaxis_title="b1 (slope)",
)
fig.show()

7) Practical usage (scikit-learn)#

In practice, MSE is most often used to evaluate regression models on a validation/test set:

  • Lower is better.

  • Because it has squared units, also report \(\mathrm{RMSE}=\sqrt{\mathrm{MSE}}\) for interpretability.

  • Use sample_weight when some samples matter more or have different noise levels.

  • For multioutput regression, use multioutput to get per-target values or an aggregate.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = x.reshape(-1, 1)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, random_state=42)

model = LinearRegression()
model.fit(X_tr, y_tr)
y_pred_te = model.predict(X_te)

mse_te = mean_squared_error(y_te, y_pred_te)
rmse_te = float(np.sqrt(mse_te))

# Example: weighted evaluation (here we weight samples with large |x| more)
w_te = 0.5 + np.abs(X_te[:, 0])
mse_te_w = mean_squared_error(y_te, y_pred_te, sample_weight=w_te)

print(f"Test MSE:  {mse_te:.4f}")
print(f"Test RMSE: {rmse_te:.4f}")
print(f"Weighted Test MSE (w = 0.5 + |x|): {mse_te_w:.4f}")

min_v = float(min(y_te.min(), y_pred_te.min()))
max_v = float(max(y_te.max(), y_pred_te.max()))

fig = go.Figure()
fig.add_trace(
    go.Scatter(
        x=y_te,
        y=y_pred_te,
        mode="markers",
        name="pred vs true",
        marker=dict(size=7, opacity=0.8),
    )
)
fig.add_trace(go.Scatter(x=[min_v, max_v], y=[min_v, max_v], mode="lines", name="ideal", line=dict(dash="dash")))
fig.update_layout(
    title=f"Test set predictions (MSE={mse_te:.3f}, RMSE={rmse_te:.3f})",
    xaxis_title="y_true",
    yaxis_title="y_pred",
)
fig.show()
Test MSE:  0.9581
Test RMSE: 0.9788
Weighted Test MSE (w = 0.5 + |x|): 0.9059

8) Pros, cons, and when to use MSE#

Pros#

  • Simple and widely used baseline for regression.

  • Differentiable and often convex (e.g. linear regression) → friendly for optimization.

  • Strongly penalizes large errors (useful when big misses are particularly costly).

  • Has a clear statistical interpretation: minimizing MSE corresponds to maximum likelihood under i.i.d. Gaussian noise.

Cons / pitfalls#

  • Outlier sensitive: a few large residuals can dominate the average.

  • Scale dependent: MSE values depend on the units of \(y\) (hard to compare across targets without normalization).

  • Squared units: less interpretable than RMSE or MAE.

  • Optimizing MSE targets the conditional mean; if you care about medians/quantiles or heavy-tailed noise, MAE / pinball loss / Huber can be better.

Good use cases#

  • Standard regression tasks where errors are roughly symmetric and not extremely heavy-tailed.

  • When large errors should be penalized more than small ones.

  • As a training loss for linear models / neural nets when Gaussian noise is a reasonable assumption.

9) Exercises#

  1. Implement root_mean_squared_error_np and verify it matches sqrt(MSE).

  2. Create a dataset with 1% extreme outliers. Compare how MSE vs MAE changes.

  3. Extend fit_line_gd to multiple features: \(\hat{y}=b+Xw\).

  4. Derive the closed-form OLS solution and compare it to gradient descent.

References#

  • scikit-learn docs: https://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error

  • Hastie, Tibshirani, Friedman — The Elements of Statistical Learning (least squares and loss functions)